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Abstract 

We propose a method with better predictions at extreme values than 
the standard method of Kriging. We construct our predictor in two ways: 
by penalizing the mean squared error through conditional bias and by 
penalizing the conditional likelihood at the target function value. Our 
prediction exhibits robustness to the model mismatch in the covariance 
parameters, a desirable feature for computer simulations with a restricted 
number of data points. Applications on several functions show that our 
predictor is robust to the non-Gaussianity of the function. 


1 Introduction 

In many fields of engineering and science, computer experiments have become 
an essential tool in studying physical processes such as the subsurface of the 
earth, aerodynamic forces on bridge decks, and channel network flow. These 
experiments can be thought of as functions: given a set of input variables in a 
hxed domain the computer experiment returns the output, which can be a single 
value, a vector, or even a function. These experiments are usually determinis¬ 
tic, that is if we run the experiment with the same set of input variables, the 
output is identical. For more discussions of problems and examples in computer 
experiments, see Sacks et al. [H] and Koehler and Owen [7]. 

Kriging is a popular way to build metamodels in computer experiments. 
The method was initially proposed by D.G. Krige [S], and improved by G. 
Matheron m- Kriging exactly interpolates the experimental data and produces 
predictions at unobserved inputs. The method also generates credible intervals 
which represent the uncertainty of the prediction. Stein m and Switzer m 
give summaries and in-depth discussions of Kriging. 

However, there are several limitations of Kriging. First of all, the Kriging 
prediction depends on the covariance hyperparameters that are usually unknown 
and need to be estimated. The variability of the predicted process highly de¬ 
pends on the hyperparameters, and the likelihood of the hyperparameters are 
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usually computationally expensive to compute and could have many local max¬ 
ima. There have been several approaches to stabilize the estimation of the 
hyperparameters, such as Covariance Tapering by Kaufman et al. [B] and Pe¬ 
nalized Kriging by Li and Sudjianto [9]. We would like to find a predictor that 
is less affected by the hyperparameters. 

Secondly, the Kriging prediction depends on the mean function that we 
need to specify before looking at the data. In Kriging, there is a “regression 
effect”, in which the predictions are pulled towards the mean function. This 
comes from minimizing the overall mean squared prediction error, and may 
give bad predictions at extreme function values. Conditional Bias-Penalized 
Kriging (CBPK) by Seo [TS] suggests minimizing the mean squared error plus 
the squared conditional bias to improve the performance at the extreme values. 
Furthermore, if there is a model mismatch, for instance if the mean function 
is assumed to be zero but actually it is a linear combination of input values, 
the predictions can be poor. Limit Kriging by Joseph [4] and Blind Kriging by 
Joseph mitigate this problem. 

In this paper, we propose a new prediction method which we call Single 
Nugget Kriging (SiNK). In section [2l we briefly introduce Kriging. In section |3l 
we discuss conditioning the likelihood at the target, a fundamental idea of the 
SiNK. In section SJ we define SiNK, and show that it gives smaller mean squared 
prediction error than usual Kriging when the function value is far from the mean 
function. In other words, SiNK is robust to misspecifying the mean function or 
covariance hyperparameters. In section[5l we compare the performance of SiNK 
to the performance of usual Kriging and Limit Kriging in several numerical 
experiments. 


2 Kriging 

Kriging, or Gaussian Process Regression, treats the deterministic function /(x) 
as a realization of a one-dimensional random field 

K(x) = m(x) -I- Z{'x.) 

where x G K.'^, m(x) is a deterministic mean function, and Z(x.) is a stationary 
Gaussian process with mean zero and covariance function 

There are three widely used Kriging models based on the mean function. 
When the mean function is a known function, it is called Simple Kriging, and 
when the function is an unknown constant /3, it is called Ordinary Kriging. 
When the mean function is a linear combination of known functions fo, ■ ■ ■, fp 
but coefficients l3o,...,l3p are unknown, namely m(x) = it is 

called Universal Kriging. 

For the covariance function, stationary covariance functions that are tensor 
products of one-dimensional kernels are popular. Let Cg [—1,1] be a 
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covariance kernel with length-scale parameter 0. Let 

K{k, y) = a^C{h) = <7^ n n 


i=i 


j=i 


where h = x — y and cr^ and {9i,..., 9d) are estimated from the data. Matern 
covariance kernels [10] are defined as 


a.fl(d) 





where is the modified Bessel function of the second kind. Matern covari¬ 

ance kernels are one of the most commonly used kernels in practice because the 
smoothness of its process, defined in terms of its mean square differentiability, 
can be parametrized through v. 

For high dimensional functions, isotropic covariances 


iL(x,y) = cr^Cedlhll) = cr^Ci 


are often used, where || • || is the Euclidean norm. If there is a measurement error 
or noise in the function, then adding a nugget effect handles the discontinuity 
in the function, namely 


d 

K{x,y) = a'^C{h) = H ^s,i\hj\) + T%{h) 

1=1 

where > 0 is a parameter and Ip is the indicator function of the set {0} C 
Throughout the paper, we only consider deterministic computer experiments 
and we will use the model with a known (or estimated) constant mean /3 for 
simplicity. The simplification of the mean function to a constant does not affect 
predictive performance in general; see Sacks et al. m. We assume that the 
hyperparameters of the covariance function are known (or estimated from the 
data), and we will focus on the prediction at a new point Xp. 

Now suppose we observe y = (Y (xi), ... ,Y (x„)), and let K = (Kij) be the 
nx n covariance matrix of y, /c(xp,Xo) be the variance of F(xo), and k(xo) be 
the covariance vector between y and Y (xp). In a matrix form, 

fc(xp,xo) k(xp)^\ 
k(xp) K J ■ 

Let 1 be the n-length vector of all ones. Then, 

r(xo)|r(X)=y -iV(m,s^) 



m = P + k{Ko)'^K ^(y-/31), and 
= fc(xo, Xp) - k(xo)^iv:“^k(xp). 


where 
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That is, the conditional distribution of T(xo) given y is N{m, s^). The Simple 
Kriging predictor is defined by the conditional mean 

^k(xo) =E[y(xo)|y] =/3 + k(xo)'^itr"^(y-/31). 

The Kriging predictor is also the Best Linear Unbiased Predictor(BLUP) 
that minimizes the mean squared prediction error (MSPE). Specifically, for 
Simple Kriging, the linear unbiased predictor K(xo) =13 + A^(y — /31) that 
minimizes 


E[(y(xo)-f(xo))2] 
with respect to A is the Simple Kriging predictor. 


3 Conditional likelihood at the target and con¬ 
ditional bias 

In this section, we investigate the idea of maximizing the conditional likelihood 
given the target function value, which is the supporting idea of the SiNK. We 
also define a class of predictors by generalizing CBPK. 

3.1 Conditional likelihood at the target 

Let’s formulate the prediction problem as an estimation problem. Instead of 
conditioning by the observed function values, we condition by the unknown 
function value at the target point and compute the likelihood. We easily find 
that 


y(Af)| y(xo) = j/o ~ iV(m, K), where (la) 

m = ,51 + fc(xo,xo)“^(yo -/3)k(xo) and (Ib) 

.K = K: - A:(xo,Xo)“^k(xo)k(xo)'^. (Ic) 

Now the conditional mean is a vector and the conditional variance is a matrix. 
The conditional log likelihood is 

Kyo) = -^(y - w)^iL“^(y - to) + constant. (2) 


Note that the maximizer of the conditional likelihood with respect to yo with 
penalty —{yo — /3)^/(2fc(xo, xq)), which is the maximum a posteriori estimate 
of yo with the prior distribution yo ~ A^(/3, fc(xo, Xq)), is the Simple Kriging 
predictor. However, the maximizer of the conditional likelihood without penalty 
(CMLE) is 


Acmle(xo) = /3 + 


fc(xo,Xo) 

k(xo)'^Kr-ik(xo) 


k(xo)^Kr \y-l31). 
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The derivation is in the appendix, section El Let us define 


P = p{^o) 


k{xoV K-^k{xo) 


fc(xo,Xo) 


Then p(xo)^ is the variance explained by conditioning divided by the marginal 
variance of yo. The quantity p(xo) always lies in [0,1], and can be understood 
as the correlation between the target function value and the data. The CMLE 
is obtained by inflating the residual term of the Simple Kriging predictor by 
l/p(xo)^- 

3.2 Conditional Bias 

The CMLE is also unbiased in the sense that E[Lcmle(xo)] = /?■ In addition, 
i^CMLE(xo) is conditionally unbiased, namely 

E[Lcmle(xo)|E(xo) = yo] = yo- 

However, for Simple Kriging, we have 

E[Lk(xo)|r(xo) = yo]= 13 + {yo - 

= /3 + p ( xo )^(?/ o -/ 3 ) T^yo 

so that Ik(xo) is conditionally biased. We can expect that for a given yo which 
is far from the prior mean, the performance of standard Kriging could be worse 
than the performance of CMLE. 

3.3 Conditional Bias-Penalized Kriging 

Conditional Bias-Penalized Kriging (CBPK) is defined as the linear unbiased 
predictor Y (xq) = /3 -I- A^(y — /31) that minimizes the MSPE plus a multiple of 
squared conditional bias (CB) 

E[(yo-P(xo))^]+®[(2/o-E[F(xo)|?/o])^] (for some J > 0) (3) 

with respect to A. Seo [15] suggests that we use 5 = 1, which leads to the 
predictor 


Lcbpk(xo) = /3 + 


2fc(xo,xo) 


fc(xo,xo) -I 
2 

1 -h p{xoY 


k(xo)'^Kr-ik(xo) 


k(xo) K (y - ^1) 


k(xo) K (y - pi). 


We observe that it is again a predictor with an inflated residual term. Dif¬ 
ferent choices of 5 in ([3]) will lead to different predictors. If 5 = 0, ([3]) is the 
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objective for Simple Kriging, and thus the minimizer Icbpk(xo) is the Sim¬ 
ple Kriging predictor. If d —>■ oo, the minimizing predictor is the CMLE. This 
matches with the fact that the CMLE is conditionally unbiased. 

The main question when using a CBPK is: which ratio between MSPE and 
CB should we use? We seek an automatic way to choose 6 instead of simply 
using (5 = 1 or applying a cross-validation-style approach. We suggest varying 
the ratio spatially, in other words, using an appropriate function of xq as S in 
the following section. Eor any nonnegative S, the generalized CBPK predictor 
for a constant mean model of the form 

y(xo) = /3 -f u>(xo)k(xo)^K:-i(y - /31) 

where rc(xo) G [1, l/p(xo)^]. For every nonnegative 5, there is a corresponding 
w{xo) G [1, l/p(xo)^]. See appendix section [B] for details. 


4 Single Nugget Kriging 

In this section, we define the Single Nugget Kriging and discuss its properties. 

4.1 Definition of SiNK 

Definition 4.1. The Single Nugget Kriging (SiNK) predictor is defined as 

IsiNK(xo) =13+ -^k(xo)'^K"^(y - /31) 

p(xo) 

which is the maximizer of the conditional likelihood given K(xo) = yo with 
penalty 

f ^ _ jyo - p(xo) 

2fc(xo, xo) (1 -k p(xo))' 

That is, the implicit prior distribution on y^ is y^ ^ ?V(/3, fc(xo, Xo)(I-l-l//9(xo)). 

SiNK is defined as the maximum a posteriori estimator with a prior distri¬ 
bution on K(xo). We inflate the prior variance only at Xq by the amount of 
uncertainty measured by p, to reduce the dependency on the prior. It is equiv¬ 
alent to assuming an independent Gaussian noise only on K(xo), so we call the 
method Single Nugget Kriging. 


Remark. The SiNK predictor is the CBPK predictor with 6 = I/p(xo); it is the 
linear unbiased predictor Y (xq) = /3 -k A^(y — /31) where A is the solution of the 
optimization problem 


minimize IE[(yo — l^(xo))^] + 


I 

p(xo) 


m.yo 


E[r(xo)|yo])"]. 
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Verifications of Definition KT\ and Remark KT\ are in appendix sections E] 
and [B] respectively. As mentioned in section [3l the ratio S is now a function 
of Xq. The conditional bias penalty is larger when we have less information on 
the target function value. Penalizing by the conditional bias by an appropriate 
multiple of the conditional bias squared will improve performance at extreme 
values. The rationale of using 6 = l/p(xo) will be discussed in section ESI 


4.2 One-point case 


To illustrate the difference among the predictors, we consider the case when 
there is only one observation. Let Vo and Vi be two output values from a 
function. We observe Vi = yi and want to predict Yq. The model in this case 
consists of 


E 



and Var 




where p > 0. The Simple Kriging predictor and the CMLE are 


Yk = /3 + p{yi - P) and 

VcMLE =/3 + - (j/i —/3) ■ (4) 

P 

If we have p close to zero, which is the case when we have little information 
on Vq, then both predictors have problems. The Simple Kriging predictor will 
depend mostly on the prior mean /3, and the CMLE predictor will have a large 
variance if the true function value is far from the prior mean. However, the 
SiNK predictor is 


TsiNK = /3 + -(yi - /3) = 2/1 
P 

which does not depend on any parameters. If one wants to rely more on the 
data than the prior mean /3, SiNK is preferable to Simple Kriging. Intuitively, 
not only when n = 1 but also when n > I, SiNK will be more robust to the 
misspecified mean and covariance than usual Kriging. 


4.3 Properties 

The main feature of SiNK is its stability which will be represented as bound¬ 
edness and localness in this section. The natural question that arises may be 
the uniqueness of a predictor with these properties. Theorem 14.31 shows that 
the SiNK predictor is the unique predictor with both of these properties, in the 
class of generalized CBPK predictors with MSPE-CB ratio 5 as a function of 
Pi^o)- 

The following proposition shows that if the covariance function is stationary, 
then the SiNK predictor is bounded. This is not the case for the CMLE because 
it is unbounded as p{xo) approaches 0. For instance, in the one-point case (|4]), 
VcMLE diverges as p -)> 0. 
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Proposition 4.1 (Boundedness). 

|PsiNK(xo) - /3| < \/k{xo,Xo)^J (y - - /?1) (5) 

Thus, if the covariance function is stationary, then 

sup |lsiNK(xo)| < oo. (6) 

xoGR"^ 

Proof. By the Cauchy-Schwartz inequality, 

I^SiNKCxo) - ^1 = -^|k(xo)'^it:”^(y - /31)| 
p(xo) 

< ^^^\/k(xo)'^it:-ik(xo)y(y - l31)'^K-^{y - /31) 

= A/fc(xo,xo)Y^(y - /31)^it:-i(y - /31) 

and equality holds when if“^/^k(xo) and — j3T) are parallel. If the 

covariance function is stationary, then the right hand side of (HI) does not depend 
on xq, thus ([ 6 |) holds. □ 

For a predictor with inflated residual of Simple Kriging predictor to be 
bounded, the maximum amount of inflation is order of l/p(xo). Roughly speak¬ 
ing, SiNK is the predictor with maximum inflation of the residual term that 
satisfies boundedness. 

Now let Jk be a set of points that have different distances from observations 
in /c’th coordinate, namely 


Jk ■■= {xo I |(xo - Xj)k\ |(xo - xi)k\ for all j 7 ^ l,j,l G {1,2,.. .,n}} (7) 

where fc G (1,2, • • • , d}. In Proposition 14.21 and Theorem 14..11 we assume that 
the new point Xg is in Jk to break the ties; we remove a measure zero set to 
simplify the argument. Also, let us define the neighborhood of an observation 
Xj for j G (1, 2,..., n} as 

R(xj) := (xo I A:(xo,Xj) > K{xo,ki) VZ j,l G ( 1 , 2 ,..., n}}. ( 8 ) 

That is, if Xg G B{xj), then Xj is the closest observation to xg in terms of 
covariance. 

Proposition 4.2 (Localness). Suppose that the covariance function is a tensor 
product of stationary kernels with length scale parameter 9 = {9i,..., 9d). Then 

lim sup |F(xg) - r(xj)| = 0 

xoGB(xj)nJfc 

where Jk and B{:x.j) are sets of points defined in (l7|) and (| 8 ]) respectively. 
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Proposition 14.21 shows that as 9k —>■ 0, if xj is the closest observation (in 
fc’th coordinate) to Xq, then the SiNK predictor 4^(xo) converges to Y{xj). In 
the following theorem, we show that the SiNK predictor is the only predictor 
that satisfies localness, in the class of generalized CBPK predictors. Note that 
as 9k —>■ 0, the Simple Kriging predictor converges to the prior mean p. 

Theorem 4.3 (Uniqueness). Consider a conditional biased penalized kriging 
predictor 


y(xo) =/3 + w;(xo)k(xo)^K: ^(y-/31) 

such that the covariance function is a tensor product of stationary kernels with 
length seale parameter 9 = {9i,... ,9^), and w{x.q) € [1, l/p(xo)^] is a contin¬ 
uous function of pfx-o)- Suppose that xi,..., x„ S Jj, (O. If there exists a 
k € {1,2,... ,d} such that 

lim sup |F(xo) - r(xj)| = 0 (9) 

®'=->0xoGB(xj)nJfc 

holds where Jk and are sets of points defined in ([7]) and ([8]) respectively, 

then w{pfx.Q)) = l/p(xo), i.e. U(xo) is the SiNK predictor. 

The proof of Proposition 14.21 and Theorem 14.31 is given in the appendix, sec¬ 
tion El Restricting r(;(xo) to be a function of p(xo) enables us to guarantee that 
w{xo) G [1, l/p(xo)^]. For example, w(xo) = l/p(xo) is always in [1, l/p(xo)^]. 
Another example for necessity of this condition is Limit Kriging (Joseph [4]) 
where the predictor has w(xo) = l/(k(xo)'^K~^l). The Limit Kriging pre¬ 
dictor has the localness property, but is not guaranteed to be a CBPK with 
nonnegative ratio J, which means we cannot guarantee better performance at 
extreme values. 

Figure [T] illustrates the property of SiNK and the difference to Ordinary 
Kriging. The function used in this figure is the 2-dimensional Zakharov function 
in [ 0 , 1 ]^, which is 

d / d \2/d 

f{x) = Y,x'i + [Yl O.bixij -I- f O.SzXi j (10) 

i=l ^ i—1 ^ ^ 2=1 ^ 

where d = 2, and the input points are 4 midpoints of the edges of a unit square. 
We fitted Ordinary Kriging and SiNK with an estimated constant mean and 
tensor product Matern 5/2 covariance. For 9i = 92 = 1, the predictions are 
quite similar because p(xo) ~ 1 for all Xq G [0,1]^. However, when 9i = 02 
are close to zero, we observe significant differences between the two predictions. 
We also observe the localness property of SiNK. The p(xo) are close to zero for 
most of the plotted points, and thus the Ordinary Kriging predictor is close to 
the estimated constant mean for points far from the observations. The SiNK 
predictor uses the function value of the observation that is the closest to the 
target point. 
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(a) e = 1. (b) 6» = 0.05. 


Figure 1: Illustration of the difference between Ordinary Kriging and SiNK. 
Gray points are the true function values, black points are the observed function 
values, cyan points are the Ordinary Kriging prediction, brown points are the 
SiNK prediction. 


The localness property of SiNK is also related to the fact that the SiNK 
prediction at Xq only depends on the ratios of the correlations with observed 
function values. For instance, suppose that we predict at another point Xq with 
covariance vector k(xg) = ck(xo), where c is in (0,1). Then 


FsiNK(Xo) = l3 + 


k(x'f -/31) 

\/k(xo)'^^!:-ik(x;)) 


= FsiNK(xo). 


Thus, the SiNK prediction at Xq is the same as the prediction at Xg. However, 
the Simple Kriging prediction is shrunk to /3 by a factor of c. Thus, even if Xg 
is far away from inputs, only the ratios of the correlation determine the SiNK 
prediction. In other words, SiNK does not automatically converge to the prior 
mean /3 as k(xg) — ^ 0, for instance if one of the 9j —>■ 0. 

In practice, even though the prediction is theoretically well bounded, divid¬ 
ing by p(xg) can be numerically unstable when p is close to zero. A practical 
fix is to use 

hsiNK.e(xo) = /3 H-—^k(xg)'^(y - /?!) (II) 

max(p(xo), e) 

for a small e. We use e = 10“^ in our numerical work. A larger e would protect 
from bad estimators of length-scale parameters that we did not encounter in 
our numerical experiments. 


4.4 Mean squared prediction error at extreme values 

Since the Simple Kriging predictor is the BLUP, the SiNK predictor has larger 
MSPE than the Simple Kriging predictor. However, Propositon l4.4l tells us that 

















SINGLE NUGGET KRIGING 


11 


SiNK will be only slightly inferior; the ratio of MSPEs is bounded. 

Proposition 4.4. 

E[(FsiNK(xo) - P(xo))2] = ^-p|^E[(yK(xo) - r(xo))^] 


That is, the RMSPE of SiNK is at most times larger than the RMSPE of 
Kriging. 

Proof. From the conditional distribution of y given F(xo) = yo (dUl), 


E[(bk(xo)-F(xo))2|y(xo)=2/o] 

= k(x.)-A-‘k(x.) - (W^oFA-'k(x„))^ ^ _ 

fc(Xo,Xo) 

E[ (bsiNK(xo) - r(xo))2 I y(xo) = yo ] 


kfxoyK ^k(xo) 
fc(xo,xo) 


= fc(xo,xo) - k(xo) K k(xo) + (yo -/3) ~ U --j 


/k(xo)^i^ ik(xo)Y 


( 12 ) 


Now since yo ^ N(/3, fc(xo, Xq)), 

E[(yK(xo) - Y (xo))^] = fc(xo, Xq) - k(xo)'^i^“^k(xo) 


and hnally 

E[(hsiNK(xo) - Y (xo))^] = 2fc(xo, Xo) - 2^ A:(xo, Xo)k(xo)^iir-ik(xo) 

= 1 + p(x„) Wk(xo)-F(xo))^]. 


by the dehnition of p(xo). □ 

Here we show that SiNK has improved performance at extreme values. This 
can be represented in two ways; conditioning on a single extreme value of Y (ccp) 
and conditioning on a region of extreme T(a;o) values. 

Proposition 4.5. If 


1 

o 


/ (l + p(xo))2 

A/fc(xo,Xo) 


/ (l + p(xo))2-l 


holds, then 

E[ (TsiNK(xo) - K(xo))2 I y(xo) = yo ] < E[ (Fk(xo) - T(xo))" | y(xo) = yo ]. 


Proof. Directly follows from m- 


□ 
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Proposition 4.6. Let 4>{-) and $(•) he the density function and distribution 
function of the standard normal distribution respectively. Let Z{xq) = |(y(xo) — 
/3)/(\/fc(xo,xo))|. For M > 0, */p(xo) > -1 + + (1 - 4>(M))/(M(/)(M)), 

then 


E 


(^SiNK(xo) - F(xo))^|Z(xo) > M 


< E 


(yK(xo) - F(xo))^|2'(xo) > M 


Proof. Let p = p(xo). From (IT^ . 

E[(Fk(xo) - F(xo))^|Z(xo)] = fc(xo,Xo)(p^ - / + 2'(xo)^(l - p^f) and 
lE[(FsiNK(xo) - r(xo))^|Z(xo)] = fc(xo,Xo)(l - p^ + Z(xo)^(l - pf). 


Using 


E[Z(xo) 2 |Z(xo) >M] 


1 

1 - 4>(M) 



+ 1 - $(M) 
1 - 4>(M) 


we get the inequality for p > — 1 + + (1 — $(M))/(M(^(M)). □ 

Figure [5] shows the relation between p(xo) and the critical z-score or the 
threshold M for z-score. The ratio of the region-conditional mean squared 
prediction error 


CMSPEsiNK 

CMSPEk 


E 


(^SiNK(xo) - F(xo))^|Z(xo) > M 


E (yK(xo)-r(xo))2|Z(xo)>M 


decreases as the threshold M increases. 


(13) 


5 Numerical experiments 

For numerical simulations, we used the DiceKriging package in R by O. Roustant 
et al. m- We fit the constant mean model for Ordinary Kriging and SiNK, with 
the maximum likelihood estimator of the constant mean /3 = {1^ K~^ 

For the covariance function, we used tensor products of Matern z/ = 5/2 kernels 
with maximum likelihood estimators of the length-scale parameters Oi,... ,0d, 
unless specified otherwise. We used e = 10“^ in equation (HH. 

To measure the performance of a predictor, we computed the empirical in¬ 
tegrated squared error (EISE) 

. riT 

y^(E(xtest,j) — Y{x.test,j)) ■ 


with an independent set of nr test points. 
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(a) The level curve for the mean squared error, (b) Contour plot of CMSPEsink/CMSPEk 
SINK outperforms Simple Kriging in the re- m- The ratio decreases as the threshold M 
gion above and to the right of the given curves, increases. 

Figure 2: Relation between p(xo) and z-score. 


5.1 Gaussian process 

We generated a realization of a 7 dimensional Gaussian process with zero mean 
and Matern covariance with length-scale hyperparameters 0 = (1,1,1,1,1,1,1) 
and stationary variance /c(x, x) = tr^ = 1. The observations were 100 points 
i.i.d. uniform in [0,1]^ and the test points were 2000 points i.i.d. uniform in 

[ 0 , 1 ] 7 . 

To emulate the real world situation where the hyperparameters are unknown, 
we estimated the hyperparameters by maximizing the likelihood. The estimated 
mean was (3 = 0.143, the estimated length-scale hyperparameters were 6 = 
(1.29, 0.92,1.18,1.41, 0.95,0.76,1.32), and the estimated stationary variance was 
= 0.94. The performance comparison between SiNK and Ordinary Kriging 
is in Table [T] We observe that SiNK had slightly inferior EISE, but showed 
better performance at extreme values. 


Table 1: Performance comparison of Ordinary Kriging and SiNK for a realiza¬ 
tion of Gaussian process and piston function. Extreme values are the function 
values with |z-score| > 2. 


Function 

Gaussian Process 

Piston Function 

Number of observations 

100 

14 

B? Ordinary Kriging 

0.818 

0.674 

SiNK 

0.814 

0.711 

Overall EISE Ratio (SiNK/Ordinary) 

1.020 

0.887 

Extreme values EISE Ratio (SiNK/Ordinary) 

0.820 

0.814 
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0 

o ^ 



Ordinaiy Kriging 
■ SiNK 

T--1-r 

5 10 15 20 

Rank 



(a) Prediction at test points with 1% largest (b) Prediction at test points with 1% smallest 
function values. function values. 


Figure 3: Ordinary Kriging and SiNK for a realization of 7-dimensional Gaussian 
process. Rank is the order of the true function values of the test points. 


Figure [3] shows the prediction at test points with extreme function values. 
We first sort the test points by the true function values and see the 1% largest 
and smallest function values. We observe that SiNK reduces the conditional 
bias by inflating the residual term. Differences are small but consistently in the 
right direction. 


5.2 Piston function 


We examined the performance of SiNK in a computer experiment; the piston 
simulation function. The piston simulation function in Zacks |19] models the 
circular motion of a piston within a cylinder. The response C is the time it 
takes to complete one cycle, in seconds. The formula of the function is 


C'(x) = 27 r. 


M 




where 


2k 


PoVo 


kVn 


V= — \ \ +4k-^Ta- A] and A = PoS + 19.62M - 


Tr 


The description of the input variables is in Table [2j 


In computer experiments, the design of inputs is also very important, because 
each experiment is expensive, and a clever design could reduce the approxima¬ 
tion error. Here we adopted Randomized QMC design (Faure sequence base 
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Table 2: Input variables x for the piston function. 


M e [30,60] 

S G [0.005,0.020] 

Vo € [0.002, 0.010] 
k G [1000, 5000] 

Po e [90000,110000] 
Ta G [290, 296] 

To G [340, 360] 


piston weight (kg) 
piston surface area (m?) 
initial gas volume (m®) 
spring coefficient (N/m) 
atmospheric pressure (N/m?) 
ambient temperature (K) 
filling gas temperature (K) 


o True Function Value 
Ordinary Kriging 
■ SiNK 


\ / 


- 1 ~ 

10 15 



Rank 


Rank 


(a) Prediction at the test points with 1% (b) Prediction at the test points with 1% 
largest function values. smallest function values. 


Figure 4: Ordinary Kriging and SiNK for the piston function. Rank is the order 
of the true function values of the test points. 


7) for observations and test points. In Table [1] we see that in this case SiNK 
performs better not only at extreme values but also overall. This result possibly 
comes from non-Gaussianity of piston function; more specifically, the reduction 
of conditional bias may have had a large effect in the test error in this case. 

Again, in Figure|4]the SiNK predictions are better at the test points with ex¬ 
treme function values than the Ordinary Kriging predictions, and the difference 
is significant at the test points with 1% smallest function values. The inflation 
of the residual is consistently in the right direction, and larger than that of the 
Gaussian process example. 

5.3 Other functions 

We fit Ordinary Kriging, Limit Kriging and SiNK for several deterministic func¬ 
tions and compared the performances. The test function codes are from Bing¬ 
ham’s website (Bingham [2]). Table [3] shows the dimension of the function, the 
number of observed points and test points, covariance type, R^, overall EISE 
ratio, and EISE ratio at extreme values for each function. The training points 
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Table 3: Performance comparison among Ordinary Kriging, Limit Kriging and SiNK. 
Matern covariance with = 5/2 and estimated length-scale parameters are used. NaN 


is the case where no function values had |z-score| > 2. 


Function 

Borehole 

Welch 

Piston 

Friedman 

Robot Arm 

Dimension 

8 

20 

7 

5 

8 

Number of training, test points 

32, 5000 

320, 5000 

49, 5000 

50, 5000 

512, 5000 

(Ordinary Kriging) 

0.934 

0.948 

0.962 

0.967 

0.854 

(Limit Kriging) 

0.942 

0.961 

0.968 

0.968 

0.858 

(SiNK) 

0.946 

0.961 

0.967 

0.968 

0.855 

Overall EISE Ratio (Limit/Ordinary) 

0.884 

0.744 

0.843 

0.977 

0.970 

Overall EISE Ratio (SiNK/Ordinary) 

0.819 

0.750 

0.876 

0.991 

0.992 

Extreme values EISE Ratio (Limit/Ordinary) 

0.876 

0.630 

0.828 

NaN 

0.866 

Extreme values EISE Ratio (SiNK/Ordinary) 

0.803 

0.489 

0.834 

NaN 

0.681 


and test points are independent and uniformly distributed in the domain of 
inputs. The number of training points for fitting each function was chosen so 
that the of Ordinary Kriging is roughly 0.95, except for fitting the Robot 
Arm function which is a comparably difficult function to fit with our prediction 
methods. 

We see that for the 5 functions that we consider, SiNK performed better than 
Ordinary Kriging in terms of EISE, and the EISE ratios are even smaller for 
extreme values. Small R^ gains are relevant because large 1 — R^ improvements 
are captured by the EISE ratios. For instance, for the Welch function, the 
SiNK predictions at points with extreme function values (function values such 
that |z-score| > 2) have roughly half EISE of the EISE of Ordinary Kriging 
predictions. In addition, we observe that the performance of Limit Kriging 
and SiNK is very similar in terms of overall EISE. Limit Kriging also shows 
improved performance at extreme values compared to Ordinary Kriging, but 
the improvement is smaller or no different than the improvement of SiNK. For 
the Friedman function, there was not a test point function value which had 
|z-score| larger than 2. This was due to the large estimate of the stationary 
variance = A:(x, x). A suspicious estimate of the stationary variance can be 
found occasionally in practice, but it is not a problem for the prediction because 
all three predictors that we are comparing do not depend on the estimate of . 
See appendix section [E] for details of the functions used in Table [3l 

6 Discussion 

We have presented an alternative to Kriging with improved predictions at the 
extreme values. We first found a link between conditional likelihood at the 
target and CBPK, and used it to define SiNK. In addition, we showed that 
SiNK has a boundedness and a localness property. In numerical experiments, 
we observed that SiNK generally performs better not only at extreme values 
but also in terms of overall integrated squared error. This result is possibly due 
to the non-Gaussianity of the functions used in the examples. 
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Appendix 

A Derivation of the CMLE 

For simplicity, let p = p{xo) = ■ By the Woodbury formula, 


K ^ = (KT - k(xo)fc(xo,xo) ^k(xo)^) ^ 
_ r^-i , K-ik(xo)k(xo)^Kr-i 


^ fc(xo,xo) - k(xo)^Kr ^k(xo)' 


Therefore 


k(xo)'^Kr ik(xo)k(xo)^K ^ _ 1 


k(xo)^.K ^ = k(xo)'^K: ^ -t- 



A:(xo,Xo) - k(xo)'^K-ik(xo) 1 - 

(14) 


Thus, differentiating the conditional log likelihood ([2]) with respect to yo. 



(1 - p2)/c(xo,Xo) 


1 


(y - m)^K-ik(xo) 


from (fT4ll . Solving dl{yo)/dyo = 0 leads to 
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B Generalization of CBPK and Remark 14.11 

Without loss of generality, let /3 = 0. Expanding (jS]), we get 

nivo - A^y)'] + (5E[(2 /o - E[A^y|2/o])'] 

= fc(xo, xq) — 2A^k(xo) + X^KX + (5E[(2/o — A^m)^] 

= fc(xo, xo) - 2A^k(xo) + A^XA + (5^1- ^ ^(^o) \ 

\ /c(xo,xo)y 

This is a quadratic form of A, and the minimizing A can be computed as in m 
by the Woodbury formula. We get 

X = (k + —--k(xo)k(xo)^) (k(xo) + 5k(xo)) 

V fc(xo,xo) J 

For 5 > 0, w(xo) = (S + l)/(Sp^ + 1) G and lim w(xo) = Ijp^- For 

(5—>-oo 

i = Ijp, w{xq) = 1/p which produces the SiNK predictor. 


C Definition of SiNK 

The logarithm of the posterior probability (up to a constant) is 

logp(»|y) = -l(y - rhfk-\y - m) - 


Differentiating with respect to y^, we get 

91ogp(yo|y) 1 


(y - rhyK ^k(xo) - 


dyo 


P (yo - P) 


k{xo,Xo) 1 + pk(xo,Xo) 

1 , N P (2/0-/3) 

--2u7- K k(xo)-——-y-- 

(1 - p^)fc(xo,xo) l + pfc(xo,xo) 


from da. Solving 91ogp(2/o|y)/92/o = 0 leads to 


yo =/3 +-k(xo)^ if ^(y-/31). 
P 


D Proof of Theorem 14.31 and Proposition 14.2 


Proof. Let the stationary variance K{x,x) = a^. Now for a target point Xg G 
B{xj) n Jfc, for I 7 ^ j, 


K{Xq,Xj) 


TT Ce,(|(x; -Xo)^|) 

Mf^ic,.(|(x,-xo).|) 



Cl 

Cl 


= 0 
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Thus we obtain 


1 


N 

A(Xo,Xj) 


k(xo) = ej 


where ej is the j-th unit vector. Noting that x^ G B{x.j) H we have 

lim -^K = In 

where In is the n x n identity matrix. Thus, 

k(xo)^i4:-ik(xo) 1 j 
lim —-^ = lim -and 

9k^Q K{xQ,Xjy Sfc^-o a^K{xQ,XjY 

K-\y - pi) 


lim 

Bk^Q 


= Vj - P- 


K{xo,Xj) 

Now note that 

r(xo) =/3 + u;(p)k(xo)^i^"^(y-/3l) 


= P + w{p)p 


a'^k{xo)^K i(y-/31) 


pa^ 


K{xo,Xj) 


Thus, to satisfy (0, 


lim w(p)p = 1 

Bk^O 


(15) 


is the condition that needs to hold. For the SINK predictor, w{p) = 1/p, so the 
condition holds, and therefore SINK has the localness property and Proposition 
14.21 holds. 

The limit range of p as 0^ —>■ 0 needs to be determined. Note that for fixed 
Xo G B(xj) n Jfe, p —7> 0 as —>■ 0. Now for any 5 G (0,1], let e = 

and Xq = Xj + tOk&k- For all sufficiently small and positive 9k, we have Xq G 
B (xj) n Jfe. Then 

lim = lim TT C'e,((xj - Xo)i) = lim Ce^{e9k) = C'i(e) = S 

Bk^O 0-2 Bk^O 

1—1 

Thus, lim p = 5 for our selection of Xq. For (TTSl) to hold, since ic is a continuous 

Bk^O 

function of p, w{6)6 = 1 must hold for all S G (0,1]. To put it differently, if (0 
holds, then it is the SiNK predictor. □ 


E Test Functions 

E.l Borehole Function 

(Morris et al. [T^ l 

/(X) = _, 

log(r/ru,) (l.5 + 
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The ranges of the eight variables are ■ (0.05, 0.15), r = (100, 50000), 

= (63070, 115600), = (990, 1110), T; = (63.1, 116), Hi = (700, 820), L = 

(1120, 1680), and = (9855,12045). 

E.2 Welch 

(Welch et al. [18]) 

/(x) = + 5(a;4 - X 2 o)‘^ + X 5 + 40xig - 5xi9 

-L X\ 

+ 0.05x2 + 0.08x3 — 0.03x6 + 0.03x7 — O.OOxg — O.Olxio — 0.07xii + 0.25x^3 
— 0.04x14 + 0.06x15 — 0.01x17 — 0.03x18, x s [—0.5,0.5]^°. 

E.3 Friedman 

(Friedman et al. |3]) 

/(x) = 10sin(7rxiX2) + 20(x3 — 0.5)^ + IOX4 + 5x5, x S [0, Ij®. 

E.4 Robot Arm 

(An and Owen [1]) 


/(x) = (M 2 + i,2)0.5^^1jej.g 


4 i 



4 



X = (01,..., 04, Al,. .., A 4 ) G [0, 2^]4 X [0,1]^ 



